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Abstract 

We introduce and analyze stochastic optimization methods where the input to each update 
is perturbed by bounded noise. We show that this framework forms the basis of a unified 
approach to analyze asynchronous implementations of stochastic optimization algorithms, by 
viewing them as serial methods operating on noisy inputs. Using our perturbed iterate frame¬ 
work, we provide new analyses of the Hogwild! algorithm and asynchronous stochastic co¬ 
ordinate descent, that are simpler than earlier analyses, remove many assumptions of previous 
models, and in some cases yield improved upper bounds on the convergence rates. We proceed 
to apply our framework to develop and analyze KroMagnon: a novel, parallel, sparse stochas¬ 
tic variance-reduced gradient (SVRG) algorithm. We demonstrate experimentally on a 16-core 
machine that the sparse and parallel version of SVRG is in some cases more than four orders of 
magnitude faster than the standard SVRG algorithm. 

Keywords: stochastic optimization, asynchronous algorithms, parallel machine learning. 


1 Introduction 

Asynchronous parallel stochastic optimization algorithms have recently gained significant traction 
in algorithmic machine learning. A large body of recent work has demonstrated that near-linear 
speedups are achievable, in theory and practice, on many common machine learning tasks m- 
[8]. Moreover, when these lock-free algorithms are applied to non-convex optimization, significant 
speedups are still achieved with no loss of statistical accuracy. This behavior has been demonstrated 
in practice in state-of-the-art deep learning systems such as Google’s Downpour SGD and 
Microsoft’s Project Adam m- 

Although asynchronous stochastic algorithms are simple to implement and enjoy excellent per¬ 
formance in practice, they are challenging to analyze theoretically. The current analyses require 
lengthy derivations and several assumptions that may not reflect realistic system behavior. More¬ 
over, due to the difficult nature of the proofs, the algorithms analyzed are often simplified versions 
of those actually run in practice. 

In this paper, we propose a general framework for deriving convergence rates for parallel, lock- 
free, asynchronous first-order stochastic algorithms. We interpret the algorithmic effects of asyn¬ 
chrony as perturbing the stochastic iterates with bounded noise. This interpretation allows us to 
show how a variety of asynchronous first-order algorithms can be analyzed as their serial coun¬ 
terparts operating on noisy inputs. The advantage of our framework is that it yields elementary 
convergence proofs, can remove or relax simplifying assumptions adopted in prior art, and can yield 
improved bounds when compared to earlier work. 
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We demonstrate the general applicability of our framework by providing new convergence anal¬ 
yses for Hogwild!, i.e., the asynchronous stochastic gradient method (SGM), for asynchronous 
stochastic coordinate descent (ASCD), and KroMagnon: a novel asynchronous sparse version of 
the stochastic variance-reduced gradient (SVRG) method [H]. In particular, we provide a modified 
version of SVRG that allows for sparse updates, we show that this method can be parallelized in the 
asynchronous model, and we provide convergence guarantees using our framework. Experimentally, 
the asynchronous, parallel sparse SVRG achieves nearly-linear speedups on a machine with 16 cores 
and is sometimes four orders of magnitude faster than the standard (dense) SVRG method. 

1.1 Related work 

The algorithmic tapestry of parallel stochastic optimization is rich and diverse extending back 
at least to the late 60s m- Much of the contemporary work in this space is built upon the 
foundational work of Bertsekas, Tsitsiklis et ah nail]; the shared memory access model that 
we are using in this work, is very similar to the partially asynchronous model introduced in the 
aforementioned manuscripts. Recent advances in parallel and distributed computing technologies 
have generated renewed interest in the theoretical understanding and practical implementation of 
parallel stochastic algorithms [15H2n| . 

The power of lock-free, asynchronous stochastic optimization on shared-memory multicore sys¬ 
tems was first demonstrated in the work of [T]. The authors introduce Hogwild!, a completely 
lock-free and asynchronous parallel stochastic gradient method (SGM) that exhibits nearly linear 
speedups for a variety of machine learning tasks. Inspired by Hogwild!, several authors devel¬ 
oped lock-free and asynchronous algorithms that move beyond SGM, such as the work of Liu et ah 
on parallel stochastic coordinate descent um- Additional work in first order optimization and 
beyond [6H8l[22l[23], extending to parallel iterative linear solvers [MIES], has further shown that 
linear speedups are possible in the asynchronous shared memory model. 

2 Perturbed Stochastic Gradients 

Preliminaries and Notation We study parallel asynchronous iterative algorithms that mini¬ 
mize convex functions /(x) with x € M^. The computational model is the same as that of Niu et 
ah [T]: a number of cores have access to the same shared memory, and each of them can read and 
update components of x in the shared memory. The algorithms that we consider are asynchronous 
and lock-free: cores do not coordinate their reads or writes, and while a core is reading/writing 
other cores can update the shared variables in x. 

We focus our analysis on functions / that are L-smooth and m-strongly convex. A function / 
is L-smooth if it is differentiable and has Lipschitz gradients 

l|V/(x) - V/(y)|| < L||x - y|| for all x,y G 

where || • || denotes the Euclidean norm. Strong convexity with parameter m > 0 imposes a curvature 
condition on /: 

/(x) > /(y) + (V/(y), X - y) + ^||x - yf for all x, y G R'". 

Strong convexity implies that / has a unique minimum x* and satishes 

(V/(x) - V/(y),x-y) >m||x-yf. 

In the following, we use i, j, and k to denote iteration counters, while reserving v and u to denote 
coordinate indices. We use 0(1) to denote absolute constants. 
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Perturbed Iterates A popular way to minimize convex functions is via first-order stochastic 
algorithms. These algorithms can be described using the following general iterative expression: 

Xj+i = Xj - 7g(xj,^j), (2.1) 

where is a random variable independent of Xj and g is an unbiased estimator of the true gradient 
of / at Xj-: = Vf(xj). The success of first-order stochastic techniques partly lies 

in their computational efficiency: the small computational cost of using noisy gradient estimates 
trumps the gains of using true gradients. 

A major advantage of the iterative formula in (12.Ih is that—in combination with strong convex¬ 
ity, and smoothness inequalities—one can easily track algorithmic progress and establish conver¬ 
gence rates to the optimal solution. Unfortunately, the progress of asynchronous parallel algorithms 
cannot be precisely described or analyzed using the above iterative framework. Processors do not 
read from memory actual iterates Xj, as there is no global clock that synchronizes reads or writes 
while different cores write/read “stale” variables. 

In the subsequent sections, we show that the following simple perturbed variant of Eq. ()2.1I1 
can capture the algorithmic progress of asynchronous stochastic algorithms. Consider the following 
iteration 

= Xj - 7g(xi + (2.2) 

where rij is a stochastic error term. For simplicity let Xj = Xj -|- nj. Then, 

l|xj+i -x*f = ||xj -7g(ij,^j) -x*f 

= llxj -x*f -27 (xj -x*,g(xj,Cj)) +7^l|g(xi,|j)f (2-3) 

= llxj -x*f - 27 (xj -x*,g(xj,|j)) +7^l|g(xi,|j)f+ 27(xj - xj, g{xj, , 

where in the last equation we added and subtracted the term 2^{xj,g{xj,^j)). 

We assume that Xj and are independent. However, in contrast to recursion (|2.ip . we no 
longer require xj to be independent of The importance of the above independence assumption 
will become clear in the next section. 

We now take the expectation of both sides in (j2.3p . Since Xj and x* are independent of ^ j , we 
use iterated expectations to obtain K{xj — x*,g{xj,^j)) = E(xj — x*, V/(xj)). Moreover, since / 
is m-strongly convex, we know that 

{xj — X * ,'Vf(xj)) > m\\xj — x*|p > ^l|xj — x*|p — m||xj — x^jp, (2.4) 

where the second inequality is a simple consequence of the triangle inequality. Now, let aj = 
E||xj — x*|p and substitute (12.4p back into Eq. (12.31) to get 

aj+i < {l-jm)aj+-/‘^E\\g{xj,Qf+2-fmE\\xj-Xjf+2-fE{xj-Xj,g{xj,^fi). ( 2 . 5 ) 

V ^ > V ^ ^ > 

dJ rfj dJ 

JXq rl-y 1x2 

The recursive equation (12.51) is key to our analysis. We show that for given i?Q, R\, and R 21 
can obtain convergence rates through elementary algebraic manipulations. Observe that there are 
three “error” terms in (j2.5p : Rq captures the stochastic gradient decay with each iteration, R\ 
captures the mismatch between the true iterate and its noisy estimate, and R 2 measures the size 
of the projection of that mismatch on the gradient at each step. The key contribution of our work 
is to show that 1) this iteration can capture the algorithmic progress of asynchronous algorithms, 
and 2) the error terms can be bounded to obtain a 0(log(l/e)/e) rate for Hogwild!, and linear 
rates of convergence for asynchronous SCD and asynchronous sparse SVRG. 
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Figure 1: The bipartite graph on the left has as its leftmost vertices the n function terms and as its 
rightmost vertices the coordinates of x. A term /e^ is connected to a coordinate xj if hyperedge Cj contains 
j {i.e., if the i-th term is a function of that coordinate). The graph on the right depicts a conflict graph 
between the function terms. The vertices denote the function terms, and two terms are joined by an edge if 
they conflict on at least one coordinate in the bipartite graph. 


3 Analyzing Hogwild! 

In this section, we provide a simple analysis of Hogwild!, the asynchronous implementation of 
SGM. We focus on functions / that are decomposable into n terms: 

1 ” 

/(x) =-X]/ci(x), (3.1) 

i=l 

where x G M'^, and each /ei(x) depends only on the coordinates indexed by the subset of 
{1,2,... ,d}. For simplicity we assume that the terms of / are differentiable; our results can be 
readily extended to non-differentiable /e^s. 

We refer to the sets ej as hyperedges and denote the set of hyperedges by £. We sometimes refer 
to /e^s as the terms of /. As shown in Fig. [H the hyperedges induce a bipartite graph between 
the n terms and the d variables in x, and a conflict graph between the n terms. Let Ac be the 
average degree in the conflict graph; that is, the average number of terms that are in conflict with 
a single term. We assume that Ac > 1, otherwise we could decompose the problem into smaller 
independent sub-problems. As we will see, under our perturbed iterate analysis framework the 
convergence rate of asynchronous algorithms depends on Ac- 

Hogwild! (Alg. [T|) is a method to parallelize SGM in the asynchronous setting [T]. It is 
deployed on multiple cores that have access to shared memory, where the optimization variable x 
and the data points that define the / terms are stored. During its execution each core samples 
uniformly at random a hyperedge s from £. It reads the coordinates u G s of the shared vector x, 
evaluates Vfs at the point read, and finally adds —'jVfs to the shared variable. 


During the execution of Hogwild! cores do not synchronize or follow an order between reads 
or writes. Moreover, they access (i.e., read or write) a set of coordinates in x without the use of any 
locking mechanisms that would ensure a conflict-free execution. This implies that the reads/writes 
of distinct cores can intertwine in arbitrary ways, e.g., while a core updates a subset of variables, 
before completing its task, other cores can read/write the same subset of variables. 

In [T], the authors analyzed a variant of Hogwild! in which several simplifying assumptions 
were made. Specifically, in [1] only a single coordinate per sampled hyperedge is updated (i.e., 
the for loop in Hogwild! is replaced with a single coordinate update); 2) the authors assumed 
consistent reads, i.e., it was assumed that while a core is reading the shared variable, no writes from 
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Algorithm 1 Hogwild! 

I 

while number of sampled hyperedges < T do in parallel 


2 

sample a random hyperedge s 


3 

[x]^ = an inconsistent read of the shared variable [x]^ 


4 

[u]^ = - 7 -g([x]^,s) 


5 

for V £ s do 


6 

[x]^ = [x]^ + [u]„ 

// atomic write 

7 

end for 


8 

end while 



other cores occur; 3) the authors make an implicit assumption on the uniformity of the processing 
times of cores (explained in the following), that does not generically hold in practice. These 
simplifications alleviate some of the challenges in analyzing Hogwild! and allowed the authors to 
provide a convergence result. As we show in the current paper, however, these simplifications are 
not necessary to obtain a convergence analysis. Our perturbed iterates framework can be used in an 
elementary way to analyze the original version of Hogwild!, yielding improved bounds compared 
to earlier analyses. 

3.1 Ordering the samples 

A subtle but important point in the analysis of Hogwild! is the need to define an order for the 
sampled hyperedges. A key point of difference of our work is that we order the samples based on 
the order in which they were sampled, not the order in which cores complete the processing of the 
samples. 

Definition 1. We denote by Si the i-th sampled hyperedge in a run of Alg. Q 

That is. Si denotes the sample obtained when line 2 in Alg. [T] is executed for the z-th time. 
This is different from the original work of [T], in which the samples were ordered according to 
the completion time of each thread. The issue with such an ordering is that the distribution 
of the samples, conditioned on the ordering, is not always uniform; for example, hyperedges of 
small cardinality are more likely to be “early” samples. A uniform distribution is needed for the 
theoretical analysis of stochastic gradient methods, a point that is disregarded in [1]. Our ordering 
according to sampling time resolves this issue by guaranteeing uniformity among samples in a trivial 
way. 

3.2 Defining read iterates and clarifying independence assnmptions 

Since the shared memory variable can change inconsistently during reads and writes, we also have 
to be careful about the notion of iterates in Hogwild!. 

Definition 2. We denote by Xj the contents of the shared memory before the i-th execution of line 
2. Moreover, we denote by Xj G the vector, that in coordinates v £ Si contains exactly what 
the core that sampled Si read. We then define [xj]^ = [xj].,; for all v ^ Si. Note that we do not 
assume consistent reads, i.e., the contents of the shared memory can potentially change while a core 
is reading. 

At this point we would like to briefly discuss an independence assumption held by all prior 
work. In the following paragraph, we explain why this assumption is not always true in practice. 
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In Appendix]^ we show how to lift such independence assumption, but for ease of exposition we 
do adopt it in our main text. 

Assumption 1. The vector'ki is independent of the sampled hyperedge Sj. 

The above independence assumption is important when establishing the convergence rate of 
the algorithm, and has been held explicitly or implicitly in prior work [H El El ED- Specifically, 
when proving convergence rates for these algorithms we need to show via iterated expectations 
that IE (xj — X*, g{xi, Si)) = (xj — x*,V(xj)), which follows from the independence of x* and s*. 
However, observe that although Xj is independent of s* by construction, this is not the case for the 
vector Xj read by the core that sampled Sj. For example, consider the scenario of two consecutively 
sampled hyperedges in Alg. [T]that overlap on a subset of coordinates. Then, say one core is reading 
the coordinates of the shared variables indexed by its hyperedge, while the second core is updating 
a subset of these coordinates. In this case, the values read by the hrst core depend on the support 
of the sampled hyperedge. 

One way to rigorously enforce the independence of Xj and Sj is to require the processors to 
read the entire shared variable x before sampling a new hyperedge. However, this might not be 
reasonable in practice, as the dimension of x tends to be considerably larger than the sparsity of 
the hyperedges. As we mentioned earlier, in Appendix!^ we show how to overcome the issue of 
dependence and thereby remove Assumption [H however, this results in a slightly more cumbersome 
analysis. To ease readability, in our main text we do adopt Assumption [TJ 

3.3 The perturbed iterates view of asynchrony 

In this work, we assume that all writes are atomic, in the sense that they will be successfully 
recorded in the shared memory at some point. Atomicity is a reasonable assumption in practice, 
as it can be strictly enforced through compare-and-swap operations [T]. 

Assumption 2. Every write in line 6 of Alg. [7] will eomplete suecessfully. 

This assumption implies that all writes will appear in the shared memory by the end of the 
execution, in the form of coordinate-wise updates. Due to commutativity the order in which these 
updates are recorded in the shared memory is irrelevant. Hence, after processing a total of T 
hyperedges the shared memory contains: 

XI 

Xo - 7g(xo, So) - • • • - 7g(xT-i, ST-i), (3.2) 

'-V-' 

XT 

where xq is the initial guess and Xj is defined as the vector that contains all gradient updates up 
to sample Sj_i. 

Remark 1. Throughout this seetion we denote g(x, Sj) = V/^^. (x), which we assume to he bounded: 
||g(x, s)|| < M. Such a uniform bound on the norm of the stoehastic gradient is true when operating 
on a bounded £00 ball; this can in turn he enforced by a simple, coordinate-wise thresholding operator. 
We ean refine our analysis by avoiding the uniform bound on ||g(x, s)||, through a simple application 
of the eo-eoercivity lemma as it was used in IWf : in this case, our derivations would only require a 
uniform bound on ||g(x*,s)||. Our subsequent derivations can be adapted to the above, however to 
keep our derivations elementary we will use the uniform bound on ||g(x, s)||. 
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Remark 2. Observe that although a core is only reading the subset of variables that are indexed by 
its sampled hyperedge, in (13.2p we use the entire vector x as the input to the sampled gradient. We 
can do this since g{xk,Sk) is independent of the coordinates ofick outside the support of hyperedge 


Using the above definitions, we define the perturbed iterates of Hogwild! as 

Xj+i = Xj - 7g(xj,Sj), (3.3) 

for z = 0,1, ...,T — 1, where Si is the z-th uniformly sampled hyperedge. Observe that all but the 
first and last of these iterates are “fake”: there might not be an actual time when they exist in the 
shared memory during the execution. However, xq is what is stored in memory before the execution 
starts, and x-p is exactly what is stored in shared memory at the end of the execution. 

We observe that the iterates in (j3.3j) place Hogwild! in the perturbed gradient framework 
introduced in ^ 

aj+i < (1 -jm)aj + -f^E\\g{xj,Sj)\\‘^+2jmE\\xj - +2jE{xj - Xj, g(xj, s^)) . 

jd3 jdJ pi ? 

Hq H2 

We are only left to bound the three error terms Rq, R^, and i? 2 ' Before we proceed, we note 
that for the technical soundness of our theorems, we have to also define a random variable that 
captures the system randomness. In particular, let ^ denote a random variable that encodes the 
randomness of the system (z.e., random delays between reads and writes, gradient computation 
time, etc). Although we do not explicitly use its distribution is required implicitly to compute the 
expectations for the convergence analysis. This is because the random samples sq, si,..., st-i do 
not fully determine the output of Alg. [TJ However, sQ) • • • j st-i along with ^ completely determine 
the time of all reads and writes. We continue with our final assumption needed by our analysis, 
that is also needed by prior art. 

Assumption 3 (Bounded overlaps). Two hyperedges st and Sj overlap in time if they are processed 
concurrently at some point during the execution of Hogwild!. The time during which a hyperedge 
Si is being processed begins when the sampling function is called and ends after the last coordinate 
of g{:ki, Si) is written to the shared memory. We assume that there exists a number t > 0, such 
that the maximum number of sampled hyperedges that can overlap in time with a particular sampled 
hyperedge cannot he more than r. 

The usefulness of the above assumption is that it essentially abstracts away all system details 
relative to delays, processing overlaps, and number of cores into a single parameter. Intuitively, 
T can be perceived as a proxy for the number of cores, z.e., we would expect that no more than 
roughly O(^cores) sampled hyperedges overlap in time with a single hyperedge, assuming that the 
processing times across the samples are approximately similar. Observe that if r is small, then 
we expect the distance between xj and the noisy iterate xj to be small. In our perturbed iterate 
framework, if we set r = 0, then we obtain the classical iterative formula of serial SGM. 

To quantify the distance between xj (z.e., the iterate read by the core that sampled sj) and xj 
(z.e., the “fake” iterate used to establish convergence rates), we observe that any difference between 
them is caused solely by hyperedges that overlap with sj in time. To see this, let Si be an “earlier” 
sample, z.e., z < j, that does not overlap with sj in time. This implies that the processing of Sj 
finishes before sj starts being processed. Hence, the full contribution of 7 g(xj, Sj) will be recorded 
in both Xj and Xj (for the latter this holds by definition). Similarly, if z > j and Si does not 
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overlap with Sj in time, then neither Xj nor Xj (for the latter, again by definition) contain any of 
the coordinate updates involved in the gradient update 7 g(xj,Si). Assumption [3] ensures that if 
i<j — Tovi>j + T, the sample Si does not overlap in time with Sj. 

By the above discussion, and due to Assumption [3l there exist diagonal matrices S;- with 
diagonal entries in {—1, 0,1} such that 


j+T 

i=j-T, i^j 


(3.4) 


These diagonal matrices account for any possible pattern of (potentially) partial updates that 
can occur while hyperedge Sj is being processed. We would like to note that the above notation 
bears resemblance to the coordinate-update mismatch formulation of asynchronous coordinate- 
based algorithms, as in [JllETlEH] . 

We now turn to the convergence proof, emphasizing its elementary nature within the perturbed 
iterate analysis framework. We begin by bounding the error terms and R 2 i^o already 
assumed to be at most M^). 

Lemma 3 . Hog wild! satisfies the recursion in dm) with 


R{ = IE||xj — Xjlp < ( 2 t -I- 8r 


and = E(xj — Xj,g{-kj,Sj)) < 


where Ac is the average degree of the conflict graph between the hyperedges. 
Proof. The norm of the mismatch can be bounded in the following way: 

j+r 2 


Ri = 


E 


S-g(x*,Si) 


< 72^E||S^g(xi,Si)f TT^E®' (S*g(xi,Si),S^g(xfe,Sfe)) 




i,k 

i^k 


< 72^E||g(xi, 


Si) 




E{||g(xi,Si)||||g(xfe,Sfc)||l(si Oskfi 0 )}, 


I 


i,k 

i^k 


since are diagonal sign matrices and since the steps g{xi,Si) are supported on the samples Sj. 
We use the upper bound ||g(xj,Sj)|| < M to obtain 

Ri < 2 t • + 72M2(2r)2 Pr(s, n s,- ^ 0 ) < . ( 2 t + 8 t‘^ — 

\ n 

The last step follows because two sampled hyperedges (sampled with replacement) intersect with 
probability at most 2-^. We can bound R 2 in a similar way: 



j+T j+T 

-^2 = 7 E ®^(Sig(x7Si),g(xi,Sj)) < 7^^ E IE{l(si n SJ / 0)} < 47MV 


t=J-T 


1=3-T 


2_^c 
n 


□ 


Plugging the bounds of Lemma [ 3 ] in our recursive formula, we see that Hogwild! satisfies the 
recursion 

o,+i = (1 — 7m) a,- -|- 7^M^ I 1 + —— -|- 47mr -|- 167mr^ —^ ). 

\ n _ nj 


(3.5) 










On the other hand, serial SGM satisfies the recursion aj+i < (1 — 7 m,) aj + 7 ^M^. If the step size 
is set to 7 = it attains target accuracy e in T > 2M^/(em^) log (^) iterations. Hence, when 
the term 6 of (| 3 . 5 p is order-wise constant. Hog wild! satisfies the same recursion (up to constants) 
as serial SGM. This directly implies the main result of this section. 


Theorem 4 . If the number of samples that overlap in time with a single sample during the exeeution 
of Hog wild! is bounded as 

( f n 
T = O [ mm < 




Ac 


Hogwild!, with step size 7 = reaches an accuracy o/E||xfc — x*|7 < e after 


T > 0(1) 


m 2 log 


em^ 


iterations. 


Since the iteration bound in the theorem is (up to a constant) the same as that of serial SGM, 
our result implies a linear speedup. We would like to note that an improved rate of 0(l/e) can be 
obtained by appropriately diminishing stepsizes per epoch (see, e.g., mm)- Furthermore, observe 
that although the bound on r might seem restrictive, it is—up to a logarithmic factor— 
proportional to the total number of iterations required by Hogwild! (or even serial SGM) to 
reach e accuracy. Assuming that the average degree of the conflict graph is constant, and that we 
perform a constant number of passes over the datm i.e., T = c - n, then r can be as large as 0(n), 
i.e., nearly linear in the number of function termsHj 


3.4 Comparison with the original Hogwild! analysis of jlj 

Let us summarize the key points of improvement compared to the original Hogwild! analysis: 

• Our analysis is elementary and compact, and follows simply by bounding the Rq, Ri, and R 2 
terms, after introducing the perturbed gradient framework of § [2l 

• We do not assume consistent reads: while a core is reading from the shared memory other 
cores are allowed to read, or write. 

• In d] the authors analyze a simplified version of Hogwild! where for each sampled hyperedge 
only a randomly selected coordinate is updated. Here we analyze the “full-update” version 
of Hogwild!. 

• We order the samples by the order in which they were sampled, not by completion time. 
This allows to rigorously prove our convergence bounds, without assuming anything on the 
distribution of the processing time of each hyperedge. This is unlike [1], where there is an 
implicit assumption of uniformity with respect to processing times. 

• The previous work of [1] establishes a nearly-linear speedup for Hogwild! if r is bounded 

as r = O ^ ^/ArA^^ , where Ar is the maximum right degree of the term-variables bipartite 
graph, shown in Fig[Tl and Al is the maximum left degree of the same graph. Observe that 
Ar • A2 > Al • Ac, where Ac is the maximum degree of the conflict graph. Here, we obtain 
a linear speedup for up to r = O (min {’^/Ac, where Ac is only the average degree 

of the conflict graph in Fig [H Our bound on the delays can be orders of magnitude better 
than that of [ 1 ]. 

hides logarithmic terms. 
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4 Asynchronous Stochastic Coordinate Descent 

In this section, we use the perturbed gradient framework to analyze the convergence of asynchronous 
parallel stochastic coordinate descent (ASCD). This algorithm has been previously analyzed in 
[SIEI]. We show that the algorithm admits an elementary treatment in our perturbed iterate 
framework, under the same assumptions made for Hogwild!. 


Algorithm 2 ASCD 

1 

while iterations < T do in parallel 


2 

X = an inconsistent read of the shared variable x 


3 

Sample a coordinate s 


4 

Us = ■ d[Vf{x)]s 


5 

[x]^ = [x]^ -k Us 

// atomic write 

6 

end while 



ASCD, shown in Alg. [2l is a linearly convergent algorithm for minimizing strongly convex 
functions /. At each iteration a core samples one of the coordinates, computes a full gradient 
update for that coordinate, and proceeds with updating a single element of the shared memory 
variable x. The challenge in analyzing ASCD, compared to Hogwild!, is that, in order to show 
linear convergence, we need to show that the error due to the asynchrony between cores decays fast 
when the iterates arrive close to the optimal solution. The perturbed iterate framework can handle 
this type of noise analysis in a straightforward manner, using simple recursive bounds. 

We define x* as in the previous section, but now the samples Si are coordinates sampled uni¬ 
formly at random from { 1 , 2,..., d}. After T samples have been processed completely, the following 
vector is contained in shared memory: 

Xl 

,- ^ -< 

xo - 7d[V/(xo)] Sof^SQ - 7(i[V/(xT-i)] 

ST — 1 5 

^ V '' 

XT 

where xq is the initial guess, is the standard basis vector with a one at position sj, [V/(x)] 5 ^. 
denotes the sj-th coordinate of the gradient of / computed at x. Similar to Hogwild! in the 
previous section, ASCD satisfies the following iterative formula 

Xj+I = Xj - 7 • d • [Vf{kj)]sjesj = x^ - 7 • g{kj,Sj). 

Notice that Es.g(xj, Sj) = V/(xj), and thus, similarly to Hogwild!, ASCD’s iterates aj = E||xj — 
x*|p satisfy the recursion of Eq. (12.5p : 

aj+i < (1 - jm) aj + 7 ^E|| 5 (xj, Sj)||^ +2jmE\\xj - Xj||^ -F 27 E(xj - Xj,g{kj,Sj)). 

4 4 4 

Before stating the main result of this section, let us introduce some further notation. Let 
us define the largest distance between the optimal vector, and the vector read by the cores 
during the execution of the algorithm:ao := maxo<fc<rE||xfc — x*|p, a value which should be 
thought of as proportional to oq = E||xo — x*|p. Furthermore, by a simple application of the 
L-Lipschitz assumption on /, we have a uniform bound on the norm of each computed gradient 
A/2 := maxo<fc<'rE||V/(xfc)|p < L^do. Here we assume that the optimization takes place in an 
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ball, so that M < oo. This simply means that the iterates will never have infinitely large coordinate 
values. This assumption is made in previous work explicitly or implicitly, and in practice it can be 
implemented easily since the projection on an ball can be done component-wise. Finally, let us 
define the condition number of / as At := where L is the Lipschitz constant, and m the strong 
convexity parameter. 

Theorem 5. Let the maximum number of coordinates that can he concurrently processed while a 
core is processing a single coordinate he at most 


T = 0 



Then, ASCD with step-size 7 = achieves an error E||xfc — x*|p < e after 

k > 0(1) ■ dK^ log ) 

iterations. 


Using the recursive inequality (12.511 . serial SCD with the same step-size as in the above theorem, 
can be shown to achieve the same accuracy as ASCD in the same number of steps. Hence, as 
long as the proxy for the number of cores is bounded as r = O(min{/t\/dlog(“o/e)“^,i^}), our 
theorem implies a linear speedup with respect to this simple convergence bound. We would like 
to note, however, that the coordinate descent literature sometimes uses more refined properties of 
the function to be optimized that can lead to potentially better convergence bounds, especially in 
terms of function value accuracy, i.e., f(^k) — f(^*) (see e.g., [5l [^[30] l. 

We would further like to remark that between the two bounds on r, the second one, i.e., 
is the more restrictive, as the first one is proportional—up to log factors—to the square root of 
the number of iterations, which is usually Ll(d). We explain in our subsequent derivation how this 
loose bound can be improved, but leave the tightness of the bound as an open question for future 
work. 


4.1 Proof of Theorem [5] 

The analysis here is slightly more involved compared to HoGWiLol.The main technical bottleneck 
is to relate the decay of Rq with that of R\, and then to exploit the sparsity of the updates for 
bounding i? 2 ' 

We start with a simple upper bound on the norm of the gradient updates. From the L-Lipschitz 
assumption on V/(x), we have 

||g(xfc,Sfc)f = d- ||V/(xfc)||^ < dL^lIxfc - x*||^ < 2(iL^||xj - x*|p 2dL^||xj - x^lp, 

where the last inequality is due to Jensen’s inequality. This yields the following result. 

Lemma 6. For any k and j we have E ||g(xfc, Sfc)||^ < 2dL'^ (aj -|-E||xj — x^p) . 

Let T be the total number of ASCD iterations, and let us define the set 

Sd. = {max{j - rr,0},... ,j - l,j,j + 1 ,..., min{j -k rr, T}}, 

which has cardinality at most 2rT-|- 1 and contains all indices around j within rr steps, as sketched 
in Fig. [21 Due to Assumption [3l and similar to [ 2 T], there exist variables ajf. € {—1,0,1} such 
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j -It j - 2t j + t j j - r j + 2t j + ir 


Figure 2: The set = {max{j — rr, 0},..., j — 1, j, j +1,..., min{j + rr, T}} comprises the indices around 
j (including j) within rr steps. The cardinality of such a set is 2 rr + 1. Here, 5g = {j}. 


that, for any index k in the set Sr, we have 

^k-^j= (4-1) 

i&S^r+l 

The above equation implies that the difference between a “fake” iterate at time j and the value 
that was read at time k can be expressed as a linear combination of any coordinate updates that 
occurred during the time interval defined by Sl_^_^. 

From Eq. (I4.ip we see that ||xfc — Xj||, for any k & Sr, can be upper bounded in terms of the 
magnitude of the coordinate updates that occur in Since these updates are coordinates of 

the true gradient, we can use their norm to bound the size of x^ — Xj. This will be useful towards 
bounding Moreover, Lemma [6] shows that the magnitude of the gradient steps can be upper 
bounded in terms of the size of the mismatches. This will in turn be useful in bounding Rq. The 
above observations are fundamental to our approach. The following lemma makes the above ideas 
explicit. 

Lemma 7. For any j G {0,... , T}, we have 



maxE||xfc - Xjlp < (37r(r + 1))^ max E||g(xfc,Sfc)||. (4.3) 

k^Sr 


Proof. The first inequality is a consequence of Lemma [6l For the second, as mentioned previously, 
we have x^ — Xj = cri,fc 7 g(xj, Sj) when k G Sr- Hence, 


E||xfc - Xjf = 7^ • E 






•! Si 




<7 max E||g(xj,Si)|p < (37r(r + I))"* max E||g(xj, Si)|F, 


where the first inequality follows due to Jensen’s inequality, and the last inequality uses the bound 
|5^+il < 2(r + l)r + l < 3T(r + l). □ 

Remark 3. The factor in the upper bound on maxE||x/i, — Xj|p in Lemma might be loose. 
We believe that it should instead be t, when r is smaller than some measure of the sparsity. If the 
sparsity of the steps g(xj,Sj) can be exploited, we suspect that the condition r = Ol^fd) in Theorem 
0 could be improved to t = O (^). 
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Let us now define for simplicity Gr = E||g(xfc, and = max^^^j E||xfc —Xj|p. 

Observe that that all gradient norms can be bounded as 

Gr = maxE||g(xfc,Sfc)|p = (imaxE||V/(xfc)||^ < d max E||V/(xfc)|p = dM^, 
k&S^r 0<k<T 

a property that we will use in our bounds. Observe that Rq = E||g(xj, = Go and = 
E||xj — Xjlp = Aq. To obtain bounds for our first two error terms, and R\, we will expand the 
recursive relations that are implied by Lemma [71 As shown in § IB.ll of the Appendix, we obtain 
the following bounds. 

Lemma 8. Let r < and set 7 = for any 9 < 1 and i> 1. Then, 

L^J- 

The Cauchy-Schwartz inequality implies the bound — \/-^ 0-^1 • Unfortunately this approach 

yields a result that can only guarantee convergence up to a factor of Vd slower than serial SOD. 
This happens because upper bounding the inner product (xj — Xj, g(xj, Sj)) by ||xj — Xj || ||g(xj, Sj)|| 
disregards the extreme sparsity of g(xj, Sj). The next lemma uses a slightly more involved argument 
to bound exploiting the sparsity of the gradient update. The proof can be found in Appendix lB.il 

Lemma 9. Let r < and r = 0(^fd). Then, < 0(1) (^Omaj + . 


Rl < 0(1) ( dL^aj + e^^dM'^) and R{ < 0(1) ( + 9 


4.1.1 Putting it all together 

We can now plug in the upper bounds on i?g, R^, and i ?2 ™ perturbed iterate recursive formula 
Oj+i < (1 -'ym)aj + E\\g(xj, Sj)\\‘^+2'ymE\\xj -Xj|p+27E(xj - Xj,g{xj, Sj)), 

^ V/ ^ V ^ V/ 

R{ Ri 


to find that AS CD satisfies 


cij+i < + 0(1) (^“^dL? + 77710^ + 7dm^^ Oj+0(l)^7^d^^dM^+7d^^^^^ . 


=r(7) 


=< 5 ( 7 ) 


Observe that in the serial case of SCD the errors Rj and zero, and Rq = E||( 5 r(xj, Sj)|p. 

By applying the Lipschitz assumption on /, we get E||( 5 r(xj, Sj)|p < dL?‘aj, and obtain the simple 
recursive formula 


flj+i < (1 — I'm + 7 ^(iL^)aj. (4.4) 

To guarantee that ASCD follows the same recursion, i.e., it has the same convergence rate as the 
one implied by Eq. (14.4p . we require that 7 m — 7 ( 7 ) > C{'^m — 7 ^ 4 +^), where G < 1 is a constant. 
Solving for 7 we get 

7 m — G' {^‘^dL^ + 7m0^ + 70m) > G{'^m — 7^4+^) 

74(1 - G) 7 m - (G' - C)-i‘^dL^ + G'{-fm9^ + -f9m) > 0 

74(G' - G)7dL2 < [(1 - G) + G'(02 + 9)]m 77 7 < 
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where C" > 1 is some absolute constant. For 7 = the (^( 7 ) term in the recnrsive bound 

becomes 

where we used the inequality < L^oq. Hence, ASCD satisfies 

aj+i < (1 - 0{lY/dK^) aj + 0(1)6^^-^ < (1 - 0(l)e/d«=)^'+^ oq + O{l)0^^ao. 

Let ns set 0 to be a sufficiently small constant so that 0{\)-^ = and solve for (. such that 
O(l)0^^ao = e/2. This gives I = 0(1) log (“o/e). Our main theorem for ASCD now follows from 
solving (1 — oo = e /2 for j. 


5 Sparse and Asynchronous SVRG 

The SVRG algorithm, presented in m, is a variance-reduction approach to stochastic gradient 
descent with strong theoretical guarantees and empirical performance. In this section, we present a 
parallel, asynchronous and sparse variant of SVRG. We also present a convergence analysis, showing 
that the analysis proceeds in a nearly identical way to that of ASCD. 

5.1 Serial Sparse SVRG 

The original SVRG algorithm of [n] runs for a number of epochs; the per epoch iteration is given 
as follows: 

Xj+i = Xj - 7 (g(xj, Sj) - g(y, Sj) + V/(y)), (5.1) 

where y is the last iterate of the previous epoch, and as such is updated at the end of every epoch. 
Here / is of the same form as in (|3.ip : 


1 ” 
i=l 

and g(x, Sj) = ^ fsj (x), with hyperedges Sj G £ sampled uniformly at random. As is common in the 
SVRG literature, we further assume that the individual fe^ terms are L-smooth. The theoretical 
innovation in SVRG is having an SGM flavored algorithm, with small amortized cost per iteration, 
where the variance of the gradient estimate is smaller than that of standard SGM. For a certain 
selection of learning rate, epoch size, and number of iterations, m establishes that SVRG attains 
a linear rate. 

Observe that when optimizing a decomposable / with sparse terms, in contrast to SGM, the 
SVRG iterates will be dense due to the term V/(y). From a practical perspective, when the SGM 
iterates are sparse—the case in several applications [I] —the cost of writing a sparse update in shared 
memory is significantly smaller than applying the dense gradient update term V/(y). Furthermore, 
these dense updates will cause significantly more memory conflicts in an asynchronons execution, 
amplifying the error terms in ( 12 .5p . and introdncing time delays due to memory contention. 

A sparse version of SVRG can be obtained by letting the support of the update be determined 
by that of g(y, Sj): 

Xj+i = Xj- 7 (g(xj,Sj)-g(y,Sj)+ Ds^.V/(y)) =:s.j-jVj, (5.2) 
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where = Ps^D, and is the projection on the support of sj and D = diag • • • ,P^^) is a 
dxd diagonal matrix. The weight is equal to the probability that index v belongs to a hyperedge 
sampled uniformly at random from £. These probabilities can be computed from the right degrees 
of the bipartite graph shown in Fig. [TJ The normalization ensures that Es.Ds.V/(y) = V/(y) and 
thus that Evj = V/(xj). We will establish the same upper bound on E||vj|p for sparse SVRG as 
the one used in to establish a linear rate of convergence for dense SVRG. As before we assume 
that there exists a uniform bound M > 0 such that ||vjH < M. 

Lemma 10. The variance of the serial sparse SVRG procedure in ()5.2p satisfies 

E||vjf < 2E||g(xj,Sj) - g(x*,Sj)f + 2E||g(y,Sj) -g(x*,Sj)f - 2V/(y)^DV/(y). 

Proof. By definition Vj = g(xj, Sj) — g(y, Sj) + Ds^.V/(y). Therefore 

E||vjf = E||g(xj,Sj) - g(y,Sj) + Ds,'V/(y)f 

< 2E||g(xj,Sj) -g(x*,Si)f+ 2E||g(y,Sj) -g(x*,Sj) -Ds.V/(y)f. 

We expand the second term to find that 

E||g(y,Si) -g(x*,Sj) -D,^.V/(y)f 

= ]E||g(y,Sj) -g(x*,Sj)f - 2E(g(y,Sj) - g(x*, sj), D,.V/(y)) + E||Ds^.V/(y)f. 

Since g(x, Sj) is supported on Sj for all x, we have 

E(g(y,Sj) - g(x*,Sj),D^^V/(y)) = E(g(y,Sj) - g(x*, s^), DV/(y)) = V/(y)^DV/(y), 

where the second equality follows by the property of iterated expectations. The conclusion follows 
because E||D,^.V/(y)f = V/(y)TDV/(y). □ 

Observe that the last term in the variance bound is a non-negative quadratic form, hence we 
can drop it and obtain the same variance bound as the one obtained in m for dense SVRG. This 
directly leads to the following corollary. 

Corollary 11. Sparse SVRG admits the same convergence rate upper bound as that of the SVRG 

ofm- 

We note that usually the convergence rates for SVRG are obtained for function value differences. 
However, since our perturbed iterate framework of § [2] is based on iterate differences, we re-derive 
a convergence bound for iterates. 

Lemma 12. Let the step size be j and the length of an epoch be Snf. Then, E||yfc — x*|p < 

0.75^ • E||yo — x*|p, where y^ is the iterate at the end of the k-th epoch. 

Proof. We bound the distance to the optimum after one epoch of length Snf: 

E||xj+i - x*f = E||xj - x*f - 27 E(xj - x*,Vj) -t 

< E||xj - x*f - 27 E(xj - x*, V/(xj)) + 272 E||g(xj,Sj) - g(x*,Sj)f 
+ 27^E||g(y,Sj)-g(x*,Sj)f 

< E||x, - x*f - 27 E(x, - x*, V/(x,)) + - x*f + 2j^L^E\\y - x*f 

< (1 — 2ym + 27 ^L^)E||xj — x*||^ -|- 27 ^L^E||y — x*|p. 


15 


The first inequality follows from Lemma[Tn]and an application of iterated expectations to obtain 
E(xj — X*, Vj) = E(xj — X*, V/(xj)). The second inequality follows from the smoothness of g(x, Sj), 
and the third inequality follows since / is m-strongly convex. 

We can rewrite the inequality as Oj+i < (1 — 27771 + 27 ^L^)aj+27^L^ao, because by construction 
y = xq. Let 7 = Then, 1 - 27 m + 27 ^^^ < 1 - ^ and 


i=0 

since Therefore 


^(1 - 27 m + (1 - 1 / 4 -^)' < (1 - 

i=0 


i=0 




Oj+i < (1 — Qj + 27 ^L^ao < (1 — ao + (1 — ' 27^ 


Oo 


i =0 


< (1 - ao + 4^272L^oo = [(1 - + V 4 


Oq. 


Setting the length of an epoch to be j = 2 • (4 k 2) gives us Uj+i < (1/2 + 1/4) • oq = 0.75 • oq, and 
the conclusion follows. □ 


We thus obtain the following convergence rate result: 

Theorem 13. Sparse SVRG, with step size 7 = 0(1)-^ and epoch size S = 0{1)k^, reaches 
accuracy E||yE — x*|p < e after E = 0(1) log (“o/e) epochs, where ye is the last iterate of the final 
epoch, and oq = ||xo — x *||2 is the initial distance squared to the optimum. 


5.2 KroMagnon: Asynchronous Parallel Sparse SVRG 

We now present an asynchronous implementation of sparse SVRG. This implementation, which we 
refer to as KroMagnon, is given in Algorithm [3l 


Algorithm 3 KroMagnon 

1 

X = y = Xo 


2 

for Epoch = 1 : E do 


3 

Compute in parallel z = V/(y) 


4 

while number of sampled hyperedges < S do in parallel 


5 

sample a random hyperedge s 


6 

[x]^ = an inconsistent read of the shared variable [x]s 


7 

[u]s = -7 • (V/s([x]s) - V/A[y]A - D^z) 


8 

for V £ s do 


9 

[x]^ = [x]^ + [u]^ 

// atomic write 

10 

end for 


11 

end while 


12 

y = X 


13 

end for 



Let v(xj,Sj) = g(xj,Sj) — g(y, Sj) + Ds^.V/(y) be the noisy gradient update vector. Then, after 
processing a total of T hyperedges, the shared memory contains: 

Xl 

.-. 

Xo - 7v(xo, So) - ... - 7v(xr-i, ST-i) • (5.3) 

'-V-' 

XT 
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We now define the perturbed iterates as Xj+i = x* — 7 v(xj,Sj) for i = 0,1,...,T — 1, where Sj 
is the i-th uniformly sampled hyperedge. Since Ev(xj,Sj) = V/(xj), KroMagnon also satisfies 
recursion (HSI): 

ttj+i < (1 -jm)aj + 7^E||v(xj, + 27 mE||xj — Xj||^+ 27 E(xj - Xj, v(xj, Sj)) . 

r>j dJ jdJ 

Hq H2 

To prove the convergence of KroMagnon we follow the line of reasoning presented in the previous 
section. Most of the arguments used here come from a straightforward generalization of the analysis 
of ASCD. The main result of this section is given below. 


Theorem 14. Let the maximum number of samples that can overlap in time with a single sample 
be bounded as 



Then, KroMagnon, with step size 7 = 0(1)-^ and epoch size S = 0{1)k‘^, attains E||y£; —x*|p < 
e after E = 0(1) log (“o/e) epochs, where ye is the last iterate of the final epoch, and ao = ||xo —x*|p 
is the initial distance squared to the optimum. 


We would like to note that the total number of iterations in the above bound is—up to a 
universal constant—the same as that of serial sparse SVRG as presented in Theorem [TSl Again, 
as with Hogwild! and ASCD, this implies a linear speedup. 

Similar to our ASCD analysis, we remark that between the two bounds on r, the second one 
is the more restrictive. The hrst one is, up to logarithmic factors, equal to the square root of the 
total number of iterations per epoch; we expect that the size of the epoch is proportional to n, the 
number of function terms (or data points). This suggests that the hrst bound is proportional to 
0{^/n) for most reasonable applications. Moreover, the second bound is certainly loose; we argue 
that it can be tightened using a more rehned analysis. 


5.3 Proof of Theorem 1141 

It is easy to see that due to Lemma [10] we get the following bound on the norm of the gradient 
estimate. 

Lemma 15. For any k and j we have 

E ||v(xfc,Sfc)||^ < 4L^ (oj +ao + E||xj -Xfc|p) . (5.4) 

Proof. Due to Lemma [TUI we have E||v(xj, Sj)|p < 2L^E||xj — x*|p + 2L^E||y — x*|p. Then, using 
the fact that y = xq and applying the triangle inequality, we obtain the result. □ 

The set Sr is dehned as in the previous section: Sr = {max{j — rr, 0},...,j — l,j,j + 
1,... ,min{j + rr, T}}, and has cardinality at most 2rT + 1. By Assumption [3l there exist di¬ 
agonal sign matrices with diagonal entries in {—1,0,1} such that 

Xfc-Xj=7 ^ S^v(xi,Si). (5.5) 


This leads to the following lemma. 
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Lemma 16. If Gr = E ||v(xfc, Sfc)||^ and = max^g^j E||xfc — Xjp, 

Gr < 4L^ (oj +00 + Ar) and A,. < {2>^T{r + l))'^Gr+i- (5.6) 

Proof. The proof for the bound on A^ is identical to the proof of Lemma [71 We then use Lemma [15] 
to bound E ||v(xfc, Sfc)|p. □ 


As explained in the remark after Lemma [71 it should be possible to improve to r in the 
upper bound on A^. Doing so would improve the condition r = of Theorem [M] to 

r = 0{f/^/~Kc). One possible approach to this problem can be found in S IB.2.2l of the Appendix. 

We can now obtain bounds on the errors due to asynchrony. The proofs for the following two 
lemmas can be found in Appendix IB. 21 

Lemma 17. Suppose t < j and 7 = Then the error terms Rq and R\ of KroMagnon 

satisfy the following inequalities: 

Ri < 0(1) + ao) + and R{ < 0(1) (^e^{aj + oq) + . 


Similarly to the ASCD derivations, we obtain the following bound for Rl- 


Lemma 18. Suppose t < j and t = O 




R 2 — ^( 1 ) [ d ■ m ■ {uj + ao) + d 


2 £ 


Lk 


5.3.1 Putting it all together 

After plugging in the upper bounds on Rq, Rj, and R 2 in the main recursion satisfied by KroMagnon, 
we find that: 


Oj+i < (1 “ 7^ + C>(1) + 70 m)) aj+ 

vr 2 

+ 0(1) (7^+2 + + jffrn) ao + + 70(1)0^^—. 

IjHv 

If we set 7 = 0(1) i.e., the same step size as serial sparse SVRG (Theorem [T3I) . then the above 
becomes 


Oj+i < Tl — 0(1)—^ aj + 0(1)—ao + 0^^'*'^O(1) 

ao + 02 ^ 0 ( 1 ) 


0 

l-0(l)^j +0(1)0 


m 2 

L2k2 

m 2 

U' 


We choose 0 = 0(1) < 1/2 to be a sufficiently small constant, so that the term 0(1)0 in the 
brackets above is at most 0.5. Then we can choose j = 0{1 )k‘^ so that the entire coefficient in the 
brackets is at most 0.75. Finally, we set i = 0(1) log so that the last term is smaller than 

e/ 8 . Let be the iterate after the k-th. epoch and = E||yfc — x*||2. Therefore, KroMagnon 
satisfies the recursion 

Ak+i < 0.75 • 4lfc + I < (0.75)'=+^Ao + 

This implies that 0(1) log(ao/e) epochs are sufficient to reach e accuracy, where oq is ||xo — x*||2 
the initial distance squared to the optimum. 

^Following [ 5 ], we optimize a quadratic penalty relaxation for vertex cover min^gjQ j^]|vi + |e| T 

§ X){u,-i;)6b(®“ + Xv — Xu,v — 1) + ^ ffv^V 7 
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Problem 

Dataset 

data 

# features 

sparsity 

Linear regression 

Synthetic 

3M 

lOK 

20 

Logistic regression 

Synthetic 

3M 

lOK 

20 

rcvl [31J from [32J 

« 700K 

« 42K 

« 73 

uri |33J from [32J 

« 3.2M 

« 2.4M 

« 116 

Vertex cover_^ 

eswiki-2013 |34H36| 

Rs 970K 

« 23M 

« 24 

wordassociation-2011 |34H36| 

« 10.6K 

« 72K 

« 7 


Table 1: The problems and data sets used in our experimental evaluation. We test KroMagnon, dense 
SVRG, and Hogwild! on three different tasks: linear regression, logistic regression, and vertex cover. We 
test the algorithms on sparse data sets, of various sizes and feature dimensions. 

6 Empirical Evaluation of KroMagnon 

In this section we evaluate KroMagnon empirically. Our two goals are to demonstrate that 
(1) KroMagnon is faster than dense SVRG, and (2) KroMagnon has speedups comparable to 
those of Hogwild!. We implemented Hogwild!, asynchronous dense SVRG, and KroMagnon 
in Scala, and tested them on the problems and datasets listed in Table [TJ Each algorithm was run 
for 50 epochs, using up to 16 threads. For the SVRG algorithms, we recompute y and the full 
gradient V/(y) every two epochs. We normalize the objective values such that the objective at 
the initial starting point has a value of one, and the minimum attained across all algorithms and 
epochs has a value of zero. Experiments were conducted on a Linux machine with 2 Intel Xeon 
Processor E5-2670 (2.60GHz, eight cores each) with 250Gb memory. 

Comparison with dense SVRG We were unable to run dense SVRG on the url and eswiki-2013 
datasets due to the large number of features. Figures [3al and [5a] show that KroMagnon is 
one-two orders of magnitude faster than dense SVRG. In fact, running dense SVRG on 16 threads 
is slower than KroMagnon on a single thread. Moreover, as seen in Fig. Hal KroMagnon on 16 
threads can be up to four orders of magnitude faster than serial dense SVRG. Both dense SVRG 
and KroMagnon attain similar optima. 

Speedups We measured the time each algorithm takes to achieve 99.9% and 99.99% of the 
minimum achieved by that algorithm. Speedups are computed relative to the runtime of the 
algorithm on a single thread. Although the speedup of KroMagnon varies across datasets, we 
find that KroMagnon has comparable speedups with Hogwild! on all datasets, as shown in 
Figure isiHdi isiidiisa [5di We further observe that dense SVRG has better speedup scaling. 
This happens because the per iteration complexity of Hogwild! and KroMagnon is significantly 
cheaper to the extent that the additional overhead associated with having extra threads leads to 
some speedup loss; this is not the case for dense SVRG as the per iteration cost is higher. 

7 Conclusions and Open Problems 

We have introduced a novel framework for analyzing parallel asynchronous stochastic gradient 
optimization algorithms. The main advantage of our framework is that it is straightforward to apply 
to a range of first-order stochastic algorithms, while it involves elementary derivations. Moreover, 
in our analysis we lift, or relax, many of the assumptions made in prior art, e.g., we do not assume 
consistent reads, and we analyze full stochastic gradient updates. We use our framework to analyze 
Hogwild! and ASCD, and further introduce and analyze KroMagnon, a new asynchronous 
sparse SVRG algorithm. 
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(a) Linear regression, synthetic 


Linear Regression (synthetic), 7 = 2.5 • 10 50 runs 



(b) Logistic regression, synthetic 
Logistic Regression (synthetic), 7 = 10“^, 50 runs 



Figure 3: Linear and logistic regression on synthetic data. In subfigures (a) and (b) we plot the convergence 
with respect to normalized objective value as a function of wall-clock time, and in (c) and (d) the speedup 
with respect to the number of threads. The above experiments are all for linear and logistic regression 
problems on synthetic data, in which we have 3 million data points, each with lOK features, and each 
data point with 20 nonzero entries. We observe that KroMagnon is significantly faster than parallel and 
dense SVRG, while they both can attain better objective values compared to constant step-size Hogwild!. 
Moreover, we observe that the speedup gains of Hogwild! and KroMagnon are scaling reasonably well 
for up to 16 threads. 


We conclude with some open problems: 

1. It would be interesting to obtain tighter bounds for the convergence of function values of 
the algorithms presented. How do the “errors” due to asynchrony influence the convergence 
rate of function values? In this case the number of iterations required to reach a target 
accuracy should scale with the condition number of the objective, not its square. More¬ 
over, the literature on stochastic coordinate descent establishes convergence results in terms 
of coordinate-wise Lipschitz constants—a more refined smoothness quantity than the full- 
function smoothness. It would be worthwhile to know if our framework can be adapted to 
take these parameters into account. 

2. Our perturbed iterates framework relies fundamentally on the strong convexity assumption. 
However, asynchronous algorithms are known to perform well on non-strongly convex (and 
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Vertex Cover (wassc), 7 = 0.05, 50 runs 




(a) Vertex cover, wordassociation 
Vertex Cover (wassc), 7 = 0.05, 50 runs 



(b) Vertex cover, eswiki-2013 


Vertex Cover (eswiki), 7 = 0.05, 50 runs 



(d) Vertex cover, eswiki-2013 


Figure 4: Vertex cover on the wordassociation-2011 and eswiki-2013 datasets. Subfigure (a) shows the 
convergence of the algorithms on wordassociation-2011, a small graph with less than 11,000 vertices. Kro- 
Magnon on a single thread is 3-4 orders of magnitude faster than desnse SVRG on this dataset. Convergence 
of KroMagnon and Hogwild! on the eswiki-2013 dataset is shown in subfigure (b); we were unable to run 
dense SVRG on this larger graph. Subfigures (c) and (d) show the speedups of the algorithms on the two 
datasets. In subfigure (c), both Hogwild! and KroMagnon exhibit poorer speedups than dense SVRG 
because of the rapid conve on the smaller wordassociation-2011 dataset. In subfigure (d) we observe that 
Hogwild! achieves a speedup of up to 8x and KroMagnon up to 5x. 


even nonconvex) objectives. Can we generalize our framework to simply convex, or smooth 
functions? Under what assumptions, or simple families of functions, can we show convergence 
for nonconvex problems? 

3. As previously explained, we believe that the upper bounds on r—the proxy for the number 
of cores—in our ASCD and KroMagnon analyses are amenable to improvements. It is an 
open problem to explore the extent of such improvements. 

4. Our analysis offers sensible upper bounds only in the presence of sparsity. It seems, however, 
that to obtain speedup results for Hogwild!, it is only necessary to have small correlation 
between randomly sampled gradients. In what practical setups do randomly selected gradients 
have sufficiently small correlation? Does that immediately imply linear speedups in the same 
way that update sparsity does? 
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(a) Logistic regression, rcvl 




(b) Logistic regression, url 
Logistic Regression (url), 7 = 10~^, 50 runs 



Figure 5: Logistic regression on the rcvl and url datasets. Subfigure (a) shows the convergence of the 
algorithms on the rcvl dataset. For a given objective value, KroMagnon is 1-2 orders of magnitude faster 
than dense SVRG. On the larger url dataset (subfigure (b)), we were unable to run dense SVRG. Note 
that some of the effect of asynchrony can be observed in these experiments: the objective values obtained 
by KroMagnon, Hogwild!, and dense SVRG are slightly different on 1 thread compared to 16 threads. 
Speedups of the algorithms are shown in subfigures (c) and (d)-KROMAGNON has a slightly better speedup 
than Hogwild! on rcvl, and the same speedup on url. 


5. In this work we analyzed three similar stochastic first-order methods. It is an open problem 
to apply our framework and provide an elementary analysis for a greater variety of stochastic 
gradient type optimization algorithms, such as AdaGrad-type schemes (similar to [6]), or 
stochastic dual coordinate methods (similar to [8]). 

6. Capturing the effects of asynchrony as noise on the algorithmic input seems to be applica¬ 
ble to settings beyond stochastic optimization. As shown recently for a combinatorial graph 
problem, a similar viewpoint enables the analysis of an asynchronous graph clustering algo¬ 
rithm [37]. It is an interesting endeavor to explore the extent to which a perturbed iterate 
viewpoint is suitable for analyzing general asynchronous iterative algorithms. 
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A Removing the Independence Assumption 

Our main analysis for Hogwild! relied on the independence between Xj and s* (Assumption [T]). 
Here, we show how to lift this assumption, conditional on the fact that each of the /e^ terms is at least 
Lg smooth. Observe that the following is no longer true: E(xj — x*, g(xj, Sj)) = E(xj — x*, V/(xj)) 
since we cannot use iterated expectations, precisely due to the lack of independence of the samples 
and the read variables. However, Xj, defined in §[3l is still independent of Si. Therefore, we expand 
our derivation in (|2.3|) in the following way: 

ttj+i < aj - 27 E(xj -x*,g{xj,Sj)) -k 7^IE||5(xj, ^j)f + 27 E(xj - Xj, gfxj, f,j)) 

+ 27E(xj - X*, g(xj, Sj) - g{xj,Sj)). 

Since Xj and Sj are independent by construction, we use iterated expectations to get: E(xj — 
X* ,g(xj, Sj)) = E(xj — X*, V/(xj)). As before, the strong convexity of / and the triangle inequality 
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imply that {xj — x*, V/(xj)) > y||xj — x*|p — m\\xj — Xjp. Putting everything together we get 
the following recursion for the sequence aj. 

aj+i < (l-7m)aj ++2-fmE\\xj - xjf +2jE{xj - Xj, g{xj, ^j)) 

' -V-' '-V-' '-V-' 

r{ r{ 

+ 27E(xj - x*,g(xj,Sj) - g{xj,Sj)). 

'• -V-' 

Ri 


The reader can verify that although and i ?2 defined now in terms of Xj, the upper bounds 
derived in § [3] still hold. We bound as follows 


-X*, g(xj, Sj) - g(%, Sj))<E||xj- - x*|| ||g(xj, Sj) - g(xj, Sj)|| < 


maj L 


H—^E||xj-Xj||" 

4 m 


where the last inequality follows by the smoothness of the gradient steps and the arithmetic- 
geometric mean inequality. Therefore 


aj+i < (l - aj + j'^ E\\g{xj,^j)\\'^ +2-fm E\\xj - Xj-f^ -t 27 E(x^- 


+ 27 ^E||x,-x,f. 
in ^ ^ ^ 

Ri 


Xj,g(Xj,^j)) 

Ri 


Since we can upper bound by the same bound derived for i^g, we obtain the following convergence 
result for Hogwild!: 


Theorem 19. If the number of samples that overlap in time with a single sample during the 
execution of Hogwild! is hounded as t = O (min {^/Ac, then Hogwild! with step-size 
7 = 0(1)^, reaches an accuracy o/E||xfc — x*|p < e after T > 0{l) ^ iterations. 

The only difference between this result and the one in our main section is that we can guarantee 
speedup for a smaller range of r. Similar ideas could be applied to the analysis of ASCD and 
KroMagnon. 


B Omitted Proofs 
B.l ASCD 

B.l.l Bounding and R\ 

Lemma 8. Let r < and set 7 = for any 0 < 1 and i> 1. Then, 

rI < 0(1} [dL^aj + and r{ < 0(1) (^O^aj + 0^^^ 

Proof. Let A = 2dLfaj, B = 2dL^ and C = ( 7 t)^. Then, we can rewrite the bounds of Lemma d 
as Or < A -|- H • and < 3^(r -|- 1)^ • C ■ Gr+i, which implies Gr < A -|- 3^(r -|- 1}^BG ■ Gr+i. 
We can now upper bound Rq = Gq, by applying the previous inequality i times. If we expand the 
formulas, we get 

e-i 

rI = Go<A ^(3* • i\)^{BGy + (3*^ • I\)'^(BGYGi. (B.l) 

i=0 
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Since 7 = and r < (the choice of r is made so that the sum in (jB.ip is significantly 
small), we have BC = 2dL‘^'y‘^T^ < 2dL^ ^ < 2 ^^^' Using the upper bound k\ < on 

each term of the sum (jB.ip . and plugging in the upper bound on BC, we get 




i=0 


i=0 


i=0 


Similarly, we obtain the following upper bound on the last term of Eq. IB.II (3^ • £\)‘^{BCY < 2 
Finally, Gi < dM‘^, and combining the above gives us < 0(1) [dlP'aj + O'^^dM'^) . 

We can now bound R\. By definition R\ = Aq, and from Lemma [7] we have Aq < 3^ • O • Gi. 
We can bound Gi similarly to Go as 


e-i 

Gi < A ^(3* • (i + ly.fiBcy + (3^ • {i + lyfiBcfCi+i. 

i=0 

As before BC < 2 ^ 2 ^ ■ Since (i + 1)! < 2G for any 0 < i < £, it follows that • (* + 

l)!)^(i3G)* < 0(1), and (3^ • (£+1)!)^(BG)^ < 0(1)0^^. Therefore, because Ge+i < dM'^, we obtain 
Gi < 0{l){dL‘^aj+6‘^^dM‘^). Since G = (yr)^ < it follows that Ri < 0(1) {0‘^aj + 9‘^^{M/L)‘^). 

□ 


B.1.2 Bounding R\ 

Lemma 9. Let t < ^ and r = 0(^). Then, R{ < 0(1) (Omaj + . 

Proof. Prom (j4.ip we can upper bound i ?2 follows. 

j + T 

Rl = E{xj-Xj,g(ij,Sj)) <'y ■ ^ E||g(xi, Si)|| • ||g(xj, Sj)|| • l(si = Sj)- (B.2) 

i=j-T 

'i¥=j 

The random variable l(sj = Sj) encodes the sparsity of the gradient steps. To take advantage of 
this sparsity we use smoothness to replace the iterates Xj and Xj, by The latter iterate is 

independent of both Si and Sj by our assumption that no more than r coordinates can be updated 
while a core is processing a single coordinate. This independence will allows us to “untangle” 
the expectation of l(sj = Sj) from the inner products in the above sum, which will result in a 
significantly improved bound on compared to applying Cauchy-Schwartz directly on it. 

For clarity, we note that when j < 3t, we have Xj-s,- = xq. From the L-Lipschitz assumption 
on the gradient V/(x), we get the following bounds 

||g(xj,Sj)|| < ||g(Xj_3^,Sj)|| + dL\\xj_3r - ^j\\ 

||g(xi,Si)|| < ||g(Xj_ 3 .r,Si)|| +dL||Xj_3.r -Xj||. 

Then, the expectation of a term ||g(xj,Sj)|| • ||g(xj, Sj)|| • l(si = Sj) in the sum (IB.2P is upper 
bounded by 


||g(Xj_3.r,Si)|| ||g(Xj_3^,Sj)|| + (dL)^||xj_3r -Xill ||Xj_3.r-Xj 


+dL||g(Xj_3^,Sj)|| ||Xj_3r - Xill + dL||g(Xj_3^,Si)|| ||Xj_3.r-%|| •l(Si = Sj) 
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We first bound the second term using Cauchy-Schwartz and the property of iterated expectation, 
to exploit the expectation of the l(sj = Sj) term 

IE{||Xj_ 3^ - Xj|| • ||Xj-3t - %|| -1(51 = Sj)} 

< ^IE{||xj_ 3^ - XiP} • E{||xj_ 3^ - XjP • l(si = Sj)} 

= Y^lE{||ij-3r -Xi||2} .E^^.{||ij_3^ = Sj)}} 

= y^Y^lE{||xj_3r -iiP} •E{||ij-3r -ijP} 

V a kesi 

'-V-' 

G4 

where the first equality follows due to xj being independent of sj] hence the expectation with respect 
to Sj can be applied to the indicator function. The last inequality follows from our arguments in 
the proof of Lemma [7] because both mismatches Xj-Sr ~ Xj and Xj-Sr ~ Xj can be written as linear 
combinations of gradient steps indexed by 5^ as in (|4.1I) . Similarly the third term satisfies the 
inequality 

E{||g(Xj_3^,sP|| • ||Xj_3r - Xill • l(Si = Sj)} 

< ^E{||g(xj-3r,Si)P} -Elllxj-Sr “ Xj f -1(5* = Sj)} 

The same bound applies for the fourth term E{||g(xj_ 3 r, Si)|| • ||xj- 3 t — Xj|| • l(sj = Sj)}, while the 
first term can be easily bounded as 

E{||g(xj-3r,Si)|| • ||g(xj-3r,Sj)|| -1(5* = 5^)} < 

Putting all pieces together, and using the prescribed value of 7 = we have that 

R 2 < (1 + dL-fT + {dLjr)'^) G 4 < C>( 1 )y^- 7 • • G 4 . 

The first inequality follows because we are summing over 2r terms in ()B.2p . To see why the second 
inequality is true, note that dLj < ^ < 1 (it is always true that the condition number k > 1 ). 
Therefore 1 + dL^r + {dLjr)'^ < 1 + r + < 3r^. As in the proof of Lemma [H we can bound G 4 

by 

£-1 

G 4 < C>( 1 )A• {i + A)\)\BC)^ + 0{1){3^ • {£ + 4)\)\BC)^Ge+i 

i=0 

< 0{l){dL‘^aj + e^^dM^). 

The result follows assuming r = 0{^fd) and 7 = □ 

Remark 4. We believe that if we use the same bounding technique that we applied for R 2 on Rq 
and R}, then we can significantly improve the restrictive bound on r. 
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B.2 KroMagnon 

B.2.1 Bounding and R-[ 

Lemma 17. Suppose t < j and 7 = Then the error terms Rq and R\ of KroMagnon 

satisfy the following inequalities: 

R{ < 0(1) + ao) + and < 0(1) + ao) + . 

Proof. Let A = ALS‘{aj + oq), B = 4L^, and C = ( 7 r)^. Then, the inequalities derived above can 
be rewritten as 

Gr < A + BAr and < 3^(r + 1)^00^+!. (B.3) 

If we expand the formulas, we get for Rq the following upper bound 

e-i 

Ri = Go<A ^(3* • i\f{BGy + (3^^ • £\f{BGYGe. 

1=0 


We chose 7 = and assumed that r < k/I, where ^ ^ is the condition number and 9 < 1. 

We chose 7 to be proportional to the step-size of the serial SVRG, and the assumption on r is 
made so that the sum in the above inequality is significantly is significantly small. Then, 


{3Gi]f{BGy < {3if^ ( 4L' 


02 


- 2 \ * 


42 • 324.2^2 £2 


“ 4 * 


and hence 


i-i 


^( 3 * • i\f{BGy < Yy, 2 "^* < 2 . 

7=0 7=0 

As in the previous sections we assume a uniform upper bound M > 0 on the size of the gradient 
steps: maxj IE||v(xj, Sj)||2 = m 2. Therefore 

Ri = Go< 0 ( 1 ) ( l 2 (^ . + + 02 £^ 2 ^ _ 

After an analogous derivation one can see that 


Ri = Ao < 0(1) e^{aj + ao) + 9 


d 2 £ 


M 2 \ 

U) 


and thus we obtain the result. 


□ 


B.2.2 Bounding 

Lemma 18. Suppose t < j and t = O 




2 £M 2 ^ 


Ri < 0(1) {9-m-{aj+ ao) + 9^^— . 

Ijtxi J 
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Proof. From (j5.5ll we can upper bound follows. 


j + T 

= E(ij-xj,v(xj,Sj)) <7- E||v(x7Si)|| • ||v(ij,sj)|| • l(s*n Sj / 0). (B.4) 

The random variable l(sj n Sj ^ 0) encodes the sparsity of the gradient steps. As in the proof of 
Lemma El we replace Xj and Xj in the above sum by Xj-s,-. When j < 3 t we define ^jSr = xq. 
Since fe^ are L-smooth, we have 

||v(%,Sj)|| < ||v(Xj_3^,Sj)|| +L||Xj_ 3^ -Xjll 

||v(Xj,Si)|| < ||v(Xj_3.r,Si)|| + L||Xj_3.r - Xi||. 

Then, the expectation of a term ||v(xj, Sj)|| • ||v(xj, Sj)|| -1(51 n Sj) in the sum (|B.2D is upper bounded 

by 


e| ^||v(xj_3r,Sj)|| ||v(Xj_ 3 T-,Sj)|| +L^||Xj_ 3 t- -Xi|| \\xj-3r - Xj\ 

+L||v(Xj_3^,Sj)|| ||Xj_3^ - Xj|| + L||v(Xj_3.r,'Si)|| ||xj-3r-Xj|| ) -1(5* nSj) 


Then, since El(si n Sj 7 ^ 0) < (recall that Ac is the average conflict degrees), can be shown 
to satisfy the inequality 


Ri < 0(l)y + ao) + 

as in the proof of Lemma El The conclusion follows because t = O 

□ 


yt) and 7 = = 


mO 

1217' 


Remark 5. Similar to ASCD, by using the same hounding technique of R 2 nn Rq and R\, we 
should signifieantly improve the restrictive bound on r in the convergence result of KroMagnon. 
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